we are happy to announce a new release of the WebDataCommons RDFa,
Microdata, and Microformat data sets.
The data sets have been extracted from the November 2013 version of the
Common Crawl covering 2.24 billion HTML pages which originate from 12.8
million websites (pay-level-domains).
Altogether we discovered structured data within 585 million HTML pages out
of the 2.24 billion pages contained in the crawl (26%). These pages
originate from 1.7 million different pay-level-domains out of the 12.8
million pay-level-domains covered by the crawl (13%).
Approximately 471 thousand of these websites use RDFa, while 463 thousand
websites use Microdata. Microformats are used on 1 million websites within
the crawl.